不平衡資料集(Imbalanced Dataset) 指的是當你的資料集中,有某部分的 label 是極少數的狀況,在這種狀況下,若單純只用準確度 accuracy 作為指標會有些偏頗,也會容易讓模型偷懶,試想要是今天二分類問題,某樣本出現的機率本身就很小,那我是不是每次都回答另一個樣本就有99%準確度。
我們今天會使用 mnist 來實驗遇到這種問題時,用 Undersampling 方式,降低其他多數樣本來提升少數樣本的準確度。
由於我們要故意降低 mnist 某些樣本數,所以這次不使用 tfds 官方提供的數據,而是自己去下載原先的 mnist 來測試。
!wget http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz .
!wget http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz .
!wget http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz .
!wget http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz .
下載結束後,將檔案解壓出來。
import gzip
image_size=28
num_images = 60000
with gzip.open('train-labels-idx1-ubyte.gz') as bytestream:
bytestream.read(8)
buf = bytestream.read(1*num_images)
train_labels = np.frombuffer(buf, dtype=np.uint8).astype(np.int64)
with gzip.open('train-images-idx3-ubyte.gz') as bytestream:
bytestream.read(16)
buf = bytestream.read(image_size*image_size*num_images)
train_images = np.frombuffer(buf, dtype=np.uint8).astype(np.float32)
train_images = train_images.reshape(num_images, image_size, image_size, 1)
num_images = 10000
with gzip.open('t10k-labels-idx1-ubyte.gz') as bytestream:
bytestream.read(8)
buf = bytestream.read(1*num_images)
test_labels = np.frombuffer(buf, dtype=np.uint8).astype(np.int64)
with gzip.open('t10k-images-idx3-ubyte.gz') as bytestream:
bytestream.read(16)
buf = bytestream.read(image_size*image_size*num_images)
test_images = np.frombuffer(buf, dtype=np.uint8).astype(np.float32)
test_images = test_images.reshape(num_images, image_size, image_size, 1)
這次實驗我選定數字 6, 8, 9 這三個樣本當作不平衡的少數樣本,原因是我覺得這三個數字形狀上有某些部分相似。
我們先將資料集按照順序排好。
idx = np.argsort(train_labels)
train_labels_sorted = train_labels[idx]
train_images_sorted = train_images[idx]
idx = np.argsort(test_labels)
test_labels_sorted = test_labels[idx]
test_images_sorted = test_images[idx]
查看各個樣本數個是多少
unique, counts = np.unique(train_labels_sorted, return_counts=True)
dict(zip(unique, counts))
產出:
{0: 5923,
1: 6742,
2: 5958,
3: 6131,
4: 5842,
5: 5421,
6: 5918,
7: 6265,
8: 5851,
9: 5949}
接下來限定 6,8,9 的樣本各取100筆。
idx_we_want = list(range(sum(counts[:6])+100)) + list(range(sum(counts[:7]) ,sum(counts[:8])+100)) + list(range(sum(counts[:9]) ,sum(counts[:9])+100))
train_label_imbalanced = train_labels_sorted[idx_we_want]
train_images_imbalanced = train_images_sorted[idx_we_want]
train_images_imbalanced, train_label_imbalanced = shuffle(train_images_imbalanced, train_label_imbalanced)
取完之後再確認一下個個樣本數:
{0: 5923,
1: 6742,
2: 5958,
3: 6131,
4: 5842,
5: 5421,
6: 100,
7: 6265,
8: 100,
9: 100}
以上是訓練及的部分,再來,因為6,8,9的樣本變少了,但是其他樣本仍然多數,為了更有感覺模型在6,8,9這三種樣本的準確度如何?我們在測試資料集中針對這三種樣本單獨抽出來,作為訓練時的驗證資料集。
idx_we_want = list(range(sum(counts[:6]),sum(counts[:6])+counts[6])) + list(range(sum(counts[:8]),sum(counts[:8])+counts[8])) + list(range(sum(counts[:9]),sum(counts[:9])+counts[9]))
test_label_689 = test_labels_sorted[idx_we_want]
test_images_689 = test_images_sorted[idx_we_want]
測試集樣本分布狀況:
{6: 958, 8: 974, 9: 1009}
好的,清潔完資料後,我們開始來測試在這種不平衡的狀況之下,訓練模型會有什麼樣的問題。
model = tf.keras.Sequential()
model.add(tf.keras.layers.Conv2D(32, [3, 3], activation='relu', input_shape=(28,28,1)))
model.add(tf.keras.layers.Conv2D(64, [3, 3], activation='relu'))
model.add(tf.keras.layers.MaxPooling2D(pool_size=(2, 2)))
model.add(tf.keras.layers.Dropout(0.25))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(128, activation='relu'))
model.add(tf.keras.layers.Dropout(0.5))
model.add(tf.keras.layers.Dense(10))
model.compile(
optimizer=tf.keras.optimizers.SGD(LR),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)
history = model.fit(
ds_train_im,
epochs=EPOCHS,
validation_data=ds_test,
)
產出:
Epoch 24/30
loss: 0.0195 - sparse_categorical_accuracy: 0.9932 - val_loss: 0.8394 - val_sparse_categorical_accuracy: 0.8089
我們得到測試集的準確度在80%左右。
接下來,我們把訓練集中的資料每個樣本各取100筆,大幅度將6,8,9以外的樣本減量到一樣100筆來訓練。
{0: 100,
1: 100,
2: 100,
3: 100,
4: 100,
5: 100,
6: 100,
7: 100,
8: 100,
9: 100}
model = tf.keras.Sequential()
model.add(tf.keras.layers.Conv2D(32, [3, 3], activation='relu', input_shape=(28,28,1)))
model.add(tf.keras.layers.Conv2D(64, [3, 3], activation='relu'))
model.add(tf.keras.layers.MaxPooling2D(pool_size=(2, 2)))
model.add(tf.keras.layers.Dropout(0.25))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(128, activation='relu'))
model.add(tf.keras.layers.Dropout(0.5))
model.add(tf.keras.layers.Dense(10))
model.compile(
optimizer=tf.keras.optimizers.SGD(LR),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)
history = model.fit(
ds_train_im,
epochs=EPOCHS,
validation_data=ds_test,
)
產出:
Epoch 27/30
loss: 0.1910 - sparse_categorical_accuracy: 0.9370 - val_loss: 0.2793 - val_sparse_categorical_accuracy: 0.9300
準確度提升成至93%,這次將 Undersampling 方法套用在 mnist 實驗算是有效果的。